Goto

Collaborating Authors

 ridge regularization




Basic Inequalities for First-Order Optimization with Applications to Statistical Risk Analysis

Paik, Seunghoon, Zhou, Kangjie, Telgarsky, Matus, Tibshirani, Ryan J.

arXiv.org Machine Learning

We introduce \textit{basic inequalities} for first-order iterative optimization algorithms, forming a simple and versatile framework that connects implicit and explicit regularization. While related inequalities appear in the literature, we isolate and highlight a specific form and develop it as a well-rounded tool for statistical analysis. Let $f$ denote the objective function to be optimized. Given a first-order iterative algorithm initialized at $θ_0$ with current iterate $θ_T$, the basic inequality upper bounds $f(θ_T)-f(z)$ for any reference point $z$ in terms of the accumulated step sizes and the distances between $θ_0$, $θ_T$, and $z$. The bound translates the number of iterations into an effective regularization coefficient in the loss function. We demonstrate this framework through analyses of training dynamics and prediction risk bounds. In addition to revisiting and refining known results on gradient descent, we provide new results for mirror descent with Bregman divergence projection, for generalized linear models trained by gradient descent and exponentiated gradient descent, and for randomized predictors. We illustrate and supplement these theoretical findings with experiments on generalized linear models.


Generalized equivalences between subsampling and ridge regularization

Neural Information Processing Systems

We establish precise structural and risk equivalences between subsampling and ridge regularization for ensemble ridge estimators. Specifically, we prove that linear and quadratic functionals of subsample ridge estimators, when fitted with different ridge regularization levels $\lambda$ and subsample aspect ratios $\psi$, are asymptotically equivalent along specific paths in the $(\lambda,\psi)$-plane (where $\psi$ is the ratio of the feature dimension to the subsample size). Our results only require bounded moment assumptions on feature and response distributions and allow for arbitrary joint distributions. Furthermore, we provide a data-dependent method to determine the equivalent paths of $(\lambda,\psi)$. An indirect implication of our equivalences is that optimally tuned ridge regression exhibits a monotonic prediction risk in the data aspect ratio. This resolves a recent open problem raised by Nakkiran et al. for general data distributions under proportional asymptotics, assuming a mild regularity condition that maintains regression hardness through linearized signal-to-noise ratios.


Interpolation can hurt robust generalization even when there is no noise

Neural Information Processing Systems

Numerous recent works show that overparameterization implicitly reduces variance for min-norm interpolators and max-margin classifiers. These findings suggest that ridge regularization has vanishing benefits in high dimensions. We challenge this narrative by showing that, even in the absence of noise, avoiding interpolation through ridge regularization can significantly improve generalization. We prove this phenomenon for the robust risk of both linear regression and classification, and hence provide the first theoretical result on \emph{robust overfitting}.




Generalized equivalences between subsampling and ridge regularization

Neural Information Processing Systems

We establish precise structural and risk equivalences between subsampling and ridge regularization for ensemble ridge estimators. Specifically, we prove that linear and quadratic functionals of subsample ridge estimators, when fitted with different ridge regularization levels \lambda and subsample aspect ratios \psi, are asymptotically equivalent along specific paths in the (\lambda,\psi) -plane (where \psi is the ratio of the feature dimension to the subsample size). Our results only require bounded moment assumptions on feature and response distributions and allow for arbitrary joint distributions. Furthermore, we provide a data-dependent method to determine the equivalent paths of (\lambda,\psi) . An indirect implication of our equivalences is that optimally tuned ridge regression exhibits a monotonic prediction risk in the data aspect ratio. This resolves a recent open problem raised by Nakkiran et al. for general data distributions under proportional asymptotics, assuming a mild regularity condition that maintains regression hardness through linearized signal-to-noise ratios.


Interpolation can hurt robust generalization even when there is no noise

Neural Information Processing Systems

Numerous recent works show that overparameterization implicitly reduces variance for min-norm interpolators and max-margin classifiers. These findings suggest that ridge regularization has vanishing benefits in high dimensions. We challenge this narrative by showing that, even in the absence of noise, avoiding interpolation through ridge regularization can significantly improve generalization. We prove this phenomenon for the robust risk of both linear regression and classification, and hence provide the first theoretical result on \emph{robust overfitting}.


Online and Offline Robust Multivariate Linear Regression

Godichon-Baggioni, Antoine, Robin, Stephane S., Sansonnet, Laure

arXiv.org Machine Learning

We consider the robust estimation of the parameters of multivariate Gaussian linear regression models. To this aim we consider robust version of the usual (Mahalanobis) least-square criterion, with or without Ridge regularization. We introduce two methods each considered contrast: (i) online stochastic gradient descent algorithms and their averaged versions and (ii) offline fix-point algorithms. Under weak assumptions, we prove the asymptotic normality of the resulting estimates. Because the variance matrix of the noise is usually unknown, we propose to plug a robust estimate of it in the Mahalanobis-based stochastic gradient descent algorithms. We show, on synthetic data, the dramatic gain in terms of robustness of the proposed estimates as compared to the classical least-square ones. Well also show the computational efficiency of the online versions of the proposed algorithms. All the proposed algorithms are implemented in the R package RobRegression available on CRAN.